Skip to content

gh-148613: Fix race in gc_set_threshold and gc_get_threshold#150356

Open
LindaSummer wants to merge 5 commits into
python:mainfrom
LindaSummer:gc_state_tsan
Open

gh-148613: Fix race in gc_set_threshold and gc_get_threshold#150356
LindaSummer wants to merge 5 commits into
python:mainfrom
LindaSummer:gc_state_tsan

Conversation

@LindaSummer
Copy link
Copy Markdown
Contributor

Issue

gh-148613

Root Cause

In free-threading, the gc_generation.threshold races in threads when one thread has objects triggered the GC.

if (gc->alloc_count >= LOCAL_ALLOC_COUNT_THRESHOLD) {
// TODO: Use Py_ssize_t for the generation count.
GCState *gcstate = &tstate->interp->gc;
_Py_atomic_add_int(&gcstate->young.count, (int)gc->alloc_count);
gc->alloc_count = 0;
if (gc_should_collect(gcstate) &&

Inside gc_should_collect we read the gcstate->young.threshold and gcstate->old[0].threshold without thread syncing.

gc_should_collect(GCState *gcstate)
{
int count = _Py_atomic_load_int_relaxed(&gcstate->young.count);
int threshold = gcstate->young.threshold;
int gc_enabled = _Py_atomic_load_int_relaxed(&gcstate->enabled);
if (count <= threshold || threshold == 0 || !gc_enabled) {
return false;
}
if (gcstate->old[0].threshold == 0) {

At the same time, the threshold setting also has no syncing protection.

cpython/Modules/gcmodule.c

Lines 170 to 175 in cb72193

gcstate->young.threshold = threshold0;
if (group_right_1) {
gcstate->old[0].threshold = threshold1;
}
if (group_right_2) {
gcstate->old[1].threshold = threshold2;

This explains why a cyclic referenced object caused this TSAN report.
The cyclic object couldn't make ref count to zero in scoped call stack, and it increments the _gc_thread_state.alloc_count to LOCAL_ALLOC_COUNT_THRESHOLD.
Then the GC collect triggered in this thread and races with another thread's update of gc_generation.threshold.

Proposed Changes

Add relaxed atomic load/store protection for the gc_generation.threshold setter and getter.

@LindaSummer
Copy link
Copy Markdown
Contributor Author

Hi @picnixz,

Sorry to bother you.
Could you help take a review on this PR?

Wish you a good day!

@pablogsal
Copy link
Copy Markdown
Member

I think this fix can be correct but this doesn't really fix the underlying problem: we are reading unsynced values so this change doesn't give us any guarantees that the values are consistent with a concurrent call to set them. Indeed we could read half of them from one write and another half from another.

I think this is fine but perhaps we need exclusive access.

@nascheme what do you think?

@picnixz
Copy link
Copy Markdown
Member

picnixz commented May 25, 2026

I think so as well. It will only mitigate some cases. AFAIU, the problem is that we have 3 values to read or write but these values can be changed by another thread at any moment right? the correct fix should be to lock the entire GC state when writing or reading.

Note that get_gc_state() already requires the caller to hold the GIL, so we could just hold it for the entirety of each function?

@nascheme
Copy link
Copy Markdown
Member

The approach of "sprinkling in" relaxed atomics to silence TSAN warnings is not correct, IMO. They don't ensure ordering. I'd suggest we keep using relaxed atomics for enabled. gc_should_collect() should acquire a mutex, snapshot the fields to locals, release, then evaluate. The mutex is only needed on the slow path of record_allocation. Writers take the mutex: gc.set_threshold() in Modules/gcmodule.c, interp->gc.long_lived_total = state->long_lived_total. The mutex is only held across the snapshot/write, avoiding deadlocks.

@LindaSummer
Copy link
Copy Markdown
Contributor Author

The approach of "sprinkling in" relaxed atomics to silence TSAN warnings is not correct, IMO. They don't ensure ordering. I'd suggest we keep using relaxed atomics for enabled. gc_should_collect() should acquire a mutex, snapshot the fields to locals, release, then evaluate. The mutex is only needed on the slow path of record_allocation. Writers take the mutex: gc.set_threshold() in Modules/gcmodule.c, interp->gc.long_lived_total = state->long_lived_total. The mutex is only held across the snapshot/write, avoiding deadlocks.

Hi @nascheme , @picnixz and @pablogsal ,

Thanks very much for your review and suggestions!
I will follow the mutex way to ensure the threshold's consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants